Stochastic Gradient Descent

Formula

$$ \begin{aligned} w^{(k+1)} &= w^{(k)} + \alpha^{(k)} \cdot \nabla_{w} \mathcal{L}^{(k)} \end{aligned} $$

Standard Gradient Descent

$$ \begin{aligned} \nabla \mathcal{L} &= \nabla \frac{1}{N}\sum_{i=1}^{N} \ell(y_i, p_i) = \frac{1}{N}\sum_{i=1}^{N} \nabla \ell(y_i, p_i) \\ &\approx \nabla \ell(y_j, p_j) \end{aligned} $$

Stochastic Gradient Descent

$$ \begin{aligned} \mathbb{E}_{j \sim \text{Uniform}(N)}[\nabla \ell(y_j, p_j)] &= \frac{1}{N} \sum_{i=1}^{N} \nabla \ell(y_i, p_i) \end{aligned} $$

SGD updates much faster than GD and allows scaling to big data (update time doesn't increase with the data size). Stochastic Gradient methods are the modus operandi of learning deep networks.